REST: Add retry and timeout configuration for REST catalog#3418
REST: Add retry and timeout configuration for REST catalog#3418rahulsmahadev wants to merge 3 commits into
Conversation
Closes apache#2772. The REST Catalog uses requests with no retries and no timeout by default, so transient 5xx/network failures bubble up immediately and slow servers can hang the client indefinitely (e.g. a Polaris instance returning 504 from a proxy). Add an optional connection: block on the catalog properties: catalog: default: uri: http://rest-catalog/ws/ connection: timeout: 60 retry: total: 5 backoff_factor: 1.0 status_forcelist: [429, 500, 502, 503, 504] allowed_methods: [GET, HEAD, OPTIONS] connection.retry is passed verbatim to urllib3.util.retry.Retry. Both keys are optional and opt-in: when neither is set the default requests behavior is preserved. Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
|
Can you take a look at the linter failures? 'make lint' will help you run the linters locally. |
| | Key | Example | Description | | ||
| | ---------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------ | | ||
| | connection.timeout | 60 | Per-request timeout in seconds. Must be a positive number. | | ||
| | connection.retry | `{total: 5, backoff_factor: 1.0}` | Mapping passed verbatim as kwargs to [`urllib3.util.retry.Retry`](https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.Retry). | |
There was a problem hiding this comment.
I feel like having a dictionary here is a bit tricky; this couples the configuration tightly with urllib3. When urllib changes something, or we want to use another library, then we have to map this. How about adding the options explicitly.
There was a problem hiding this comment.
+1 on this.
Allowing so many retry options can also lead to unintended behaviors. As an example, what happens if you set raise_on_status to False, but you receive a 403 that you should raise an error on?
I think we should focus on timeout, # of retries, and a backoff factor.
There was a problem hiding this comment.
Good call — dropped the dict pass-through. The connection block now exposes only three explicit options: timeout, retries, and backoff-factor. The status_forcelist and allowed_methods are hard-coded internally to transient codes (429/500/502/503/504) and idempotent methods (GET/HEAD/OPTIONS), so users can't accidentally swallow non-transient errors or retry non-idempotent writes. Fixed in 6fb87ff.
| assert "Could not find the TLS certificate file, invalid path: path_to_client_cert" in str(e.value) | ||
|
|
||
|
|
||
| def test_session_without_connection_config_uses_default_adapter(rest_mock: Mocker) -> None: |
There was a problem hiding this comment.
Can we get a test where we set the retry logic and then see the retries occur? We should be able to simulate this with mock HTTP calls and then see that X number of HTTP calls were made afterwards.
There was a problem hiding this comment.
Added test_session_retries_on_transient_5xx_then_succeeds in 6fb87ff. requests_mock actually replaces the HTTPAdapter on the session, which bypasses our retry logic, so the test instead stands up a real http.server on a loopback port. The handler returns three 503s followed by a 200, and the test asserts both that list_namespaces succeeds and that the handler saw 4 requests.
Per @Fokko + @rambleraptor: drop the urllib3 retry-dict pass-through and expose only three explicit knobs on the connection block (timeout, retries, backoff-factor). Hard-code the retry policy (status_forcelist of transient codes; allowed_methods of idempotent verbs) so users cannot misconfigure e.g. raise_on_status=False and silently swallow 4xx errors. Per @rambleraptor: add a test that exercises the retry path end-to-end by spinning up a loopback HTTP server that returns three 503s then a 200, verifying the catalog makes all four attempts. requests_mock can't be used here because it replaces the HTTPAdapter and bypasses retry logic. Fix the three mypy errors flagged by CI: - _RetryTimeoutHTTPAdapter.send now matches HTTPAdapter.send's full signature instead of (request, **kwargs). - Test's set(adapter.max_retries.allowed_methods) now guards the Collection[str] | None type. Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
|
Pushed 6fb87ff addressing the review feedback:
PTAL! |
|
Can you run |
requests' HTTPAdapter.send declares proxies as Mapping[str, str] | None (via types-requests). Overriding with the more specific MutableMapping is an invalid override under mypy 1.18 strict mode. Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
|
Linters are now passing — |
Rationale for this change
The REST Catalog uses
requestswith no retries and no timeout configured, so transient 5xx / network failures bubble up immediately and slow servers can hang the client indefinitely. The reporting user on #2772 hit infinite-timeout hangs against a Polaris instance returning504from a proxy.This change adds an optional
connection:block on the catalog properties for opting in to a per-request timeout and a retry policy, as proposed in the issue body:Changes:
connection,connection.timeout,connection.retrycatalog properties._RetryTimeoutHTTPAdapterthat subclassesrequests.adapters.HTTPAdapterand applies a default timeout to every call (sincerequestsdoes not expose a session-wide timeout).connection.retrysub-mapping is passed verbatim tourllib3.util.retry.Retry, so any kwarg the upstream class accepts can be configured.requestsbehavior is preserved — no behavior change for existing users.timeoutraisesValueError; unknownRetrykwargs are caught and re-raised asValueErrorwith a clearer message.The SigV4 adapter mounted at the catalog URI (see #3063) is a longer-prefix match and still wins for that host, so SigV4 users continue to get their existing retry behavior unchanged — this change deliberately stays out of that path.
Are these changes tested?
Yes. Added five tests in
tests/catalog/test_rest.py:test_session_without_connection_config_uses_default_adapter— noconnection:block → no_RetryTimeoutHTTPAdapteris mounted (defaultrequestsbehavior preserved).test_session_with_connection_timeout_and_retry— full nestedconnection:block applies timeout and allRetrykwargs (total,backoff_factor,status_forcelist,allowed_methods).test_session_with_connection_timeout_only—timeoutalone works without aretryblock.test_session_with_invalid_connection_timeout_raises— negativetimeoutraisesValueError.test_session_with_invalid_connection_retry_kwarg_raises— unknownRetrykwarg raisesValueError.Are there any user-facing changes?
Yes — two new optional REST catalog properties (
connection.timeout,connection.retry) documented inmkdocs/docs/configuration.mdunder "Retry and timeout". No behavior change for users who do not set them.